Introduction

Motivation

barcodetrackR is an R package developed for the analysis and visualization of clonal tracking data from cellular barcoding experiments.

Installation

Currently, barcodetrackR is available at Github and can be downloaded using the devtools package.

Contributors

The R package and functions were created by Diego A. Espinoza, Ryland Mortlock, Samson J. Koelle, and others at Cynthia Dunbar’s laboratory at the National Heart, Lung, and Blood Institutes of Health. Issues should be addressed to https://github.com/d93espinoza/barcodetrackR/issues.

Loading data

Loading required packages

The barcodetrackR package operates on SummarizedExperiment objects from the Bioconductor repository. It stores associated colData for each sample in this object as well as any metadata. We load the barcodetrackR and SummarizedExperiment packages here for our analyses, as well as the magrittr package in order to improve legibility of code through using the pipe %>% operator.

Creating objects with create_SE

For this vignette, we will load publically available data from the following papers (these sample datasets are included in the R package):

In addition to reads data, the Six et al. paper includes “estimated abundance” data for each of these timepoints. We load these into a new SE in order to compare our analyses downstream.

Our input dataframes to create the SummarizedExperiment (SE) objects are each an n x m data.frame where there are n rows of observations (typically cellular barcodes, insertion sites, or the like) and the m columns are the samples. The input metadata must have row order identical to the order of the colums in its corresponding dataframe. The metadata must also have a column titled SAMPLENAME that denotes the column of your_data it refers to.

Assays created by create_SE

create_SE takes the input dataframe and metadata and creates a SummarizedExperiment object with the following assays:

  • counts: the raw values from the input dataframe
  • percentages: the per-column proportions of each entry in each column
  • ranks: the rank of each entry in each column
  • normalized: the normalized read values (CPM)
  • logs: the log of the normalized values
## List of length 5
## names(5): counts percentages ranks normalized logs

Correlations

scatter_plot

A straightforward way to view the relationship between samples in a pairwise manner is to view basic scatter plots of two samples using the provided assays. We provide a scatter_plot function as part of the package.

Here, we view the correlation of barcode counts between different cell types at the 20 month timepoint of the Wu et al study. We compare granulocytes (Gr) to B and T cells.

cor_plot

A more comprehensive way to view the relationship between samples in a pairwise manner is to use a correlation plot. Here, we view the Pearson correlation between the T, B, Gr, NK CD56+/CD16-, and NK CD56-/CD16+ fractions within the Wu dataset for the 6, 9.5, 12, and 20 month post-transplant timepoints.

We can also use the spearman correlation as our metric of interest, and visualize our correlation plot using circles with areas proportional to the absolute value of the correlation and colors corresponding to the value of the correlations.

We can return a table of the Pearson correlations as well as the p-values and confidence intervals for each of the comparisons above.

##    sample_i    sample_j correlation_value       p_value     ci_lo     ci_hi
## 1 ZJ31_6m_T   ZJ31_6m_T         1.0000000  0.000000e+00 1.0000000 1.0000000
## 2 ZJ31_6m_T ZJ31_9.5m_T         0.4672994  0.000000e+00 0.4527245 0.4816246
## 3 ZJ31_6m_T  ZJ31_12m_T         0.4406121  0.000000e+00 0.4261147 0.4548832
## 4 ZJ31_6m_T  ZJ31_20m_T         0.3730234  0.000000e+00 0.3564373 0.3893744
## 5 ZJ31_6m_T   ZJ31_6m_B         0.2263901 7.739959e-201 0.2122372 0.2404479
## 6 ZJ31_6m_T ZJ31_9.5m_B         0.2346975 2.001192e-191 0.2197056 0.2495785

Above, we used two of the three available correlation visualizations ("color" and "circle") using the standard color palette provided. A no_negative parameter is offered as well to eliminate negative correlations within the data, from which deriving biological meaning may be difficult.

When there are a smaller number of samples to analyze, the "number" option can be used to view the actual correlation within the grid. Here is an example comparing the pearson correlation of clones within peripheral blood at subsequent timepoints from the belberdos sample data set.

Clonal patterns

barcode_ggheatmap

A useful visualization to study clonal patterns over time is by using a heat map which clusters the top clones based on relatedness and displays their proportion in each sample over time. Our function barcode_ggheatmap does this by choosing the top N clones (n_clones) within each sample and tracking them over time. The argument n_clones assumes that in most cases, the large-contributing clones are of most interest to the user. This assumption can be relaxed by passing a large value to the argument.

We first visualize the top 10 clones from the selected samples in the Wu dataset. The default cell note is stars for the top 10 clones in each sample.

We can also visualize the percentage contribution of each of these clones and add a dendogram which clusters clones based on the Euclidean distance between each clone’s log assay. Here we plot the top 5 clones within the Six dataset. First, we order the columns to group them by celltype.

Splitting the clones into 4 clusters makes it easy to identify clones which are biased towards Monocytes (purple), T cells (green), NK cells (light blue), and B cell / Granulocyte (red).

The Belderbros dataset generally contains less clones than either the Wu or Six datasets, likely because the data is from mice. When possible, using the “percentages” option of the cellnote_assay aqrgument provides maximal information.

Compared to the other data sets, this one generally shows a smaller number of clones making up a large proportion of the hematopoiesis.

barcode_binary_heatmap

In some cases, we may be interested in a global view of the presence or absence of barcodes across samples, regardless of read abundance. In that case, a binary heat map can be generated using barcode_binary_heatmap to give the simplest visual representation. Here we view the binary heat map of the belderbos data with a threshold of 0.01, meaning clones that make up less than 1% of a sample are treated as not detected.

clonal_contribution

Another familiar way to visualize clonal patterns over time is using a line or bar chart showing the proportion of top clones. In the above heat maps for wu data, we could see that there are some large uni-lineage CD56-/CD16+ NK cell clones. We can view the expansion of the top clones from the final timepoint through a stacked area line chart showing the proportion of each clone in CD56-/CD16+ NK cell samples across time.

The colored clones represent the top 10 clones in the 20 month CD56-/CD16+ NK sample. The gray clones are other smaller clones which are found in the CD16+ samples at any of the four timepoints.

For data with fewer clones, a bar chart might be appropriate. We can view the peripheral blood samples from the belderbos dataset, previously shown as a heat map. By turning the plot_non_selected argument to false, we can only look at the top 10 clones from the chosen sample and ignore. We can also use categorical spacing on the x-axis rather than numeric by setting keep_numeric to false.

Clonal Bias

ridge_plot

The ridge plot function calculates the density of clones at different values of bias between two selections. Note that log_bias is calculated as log(selection_1/selection_2 + 1) so that it can handle clones which are zero in one of the samples. If the "plot_by" argument is provided, the ridge plot will have a y axis corresponding to samples along the variable provided which must be a column of colData.

Here, we view a ridge plot showing clonal bias between B and T cells from the Wu dataset for the 6, 9.5, 12, and 20 month post-transplant timepoints.

By default, the ridge plot calculates the density of clones at each value of the log bias. It does not consider the fractional contribution of the clones. If you want larger clones to be more influential to the density estimation, set weighted = TRUE to make a weighted heat map.

For this dataset, the weighted heat map shows a more balanced (less biased) clonality between B and T cells as compared to the regular ridge plot. This means that although there are highly biased clones, the larger clones tend to be shared between B and T cells.

Creating a ridge plot based on the Six et al data also shows that the clonality between B and T cells (as assessed by viral integration site analysis) becomes less biased over time.

bias_histogram

bias_distribution_barplot

ternary_plot

Diversity

rank_abundance

A way to depict clone richness and evenness within a plot is by using a rank-abundance plot. Here, the cumulative abundance of every clone in a sample is plotted in descending rank from 1 to n (where there are n clones in the sample being plotted), or by scaling all ranks to the range [0,1].

We plot all samples from the Belderbos dataset. Note that the week 10 and 14 samples appear to have the most evenness across detected clones, while the other samples contain both large and small detected clones.

clonal_diversity

Within-sample diversity indices (also referred to as alpha diversities) are indices computed independently for each sample in a data set. With clonal tracking, these diversity indices can give a global indication about the number of species in a sample using the number of detected species as input and sometimes also leveraging the proportional abundances of species within the sample. We include the function clonal_diversity which can calculate three diversity indices (making use the vegan package):

  • "shannon" \(H'=-\sum _{i=1}^{R}p_{i}\ln p_{i}\)
  • "simpson" \(\lambda =\sum _{i=1}^{R}p_{i}^{2}\)
  • "invsimpson"\({\displaystyle {\frac {1}{\lambda }}={1 \over \sum _{i=1}^{R}p_{i}^{2}}={}^{2}D}\)

We also include "count" as an option for index_type in order to use the total detected clones per sample as a measure for diversity.

As an example, we can plot the shannon diversities for the 5 cell types within the Wu dataset over time.

Here we show the simpson indices for the unsorted samples in the Belderbos dataset over time; the last time point splits the cell fraction in to T, B, and Granulocyte fractions, allowing comparison of their simpson indices.

Similar to the Wu dataset, the Six dataset contains clonal tracking information for the T, B, Gr, Monocyte, and NK lineages. We can plot these over time as well and interstingly observe that the shannon diversity of the NK cell fraction follows a similar pattern to that of the Wu dataset.

mds_plot

Measures of simmilarity or dissimilarity between samples are known as beta-diversity indices (or distances if they are metrics). A common way for depicting these beta-diversity indices are using what are known as PCoA (Principal Coordinate Analysis) plots, in which an input distance matrix is plotted in two dimensions. Again, we leverage the "vegan" package here to call vegandist which allows us to calculate a number of dissimilarity indices between our samples (choosing an assay from the SummarizedExperiment object) and then perform principal coordinates analysis using cmdscale. Note that using "euclidean" as our index is equivalent to performing PCA (Prinicipal Components Analysis) on our data.

One of the most commonly used beta-diversity indices is the (Bray-Curtis Dissimilarity)[https://en.wikipedia.org/wiki/Bray%E2%80%93Curtis_dissimilarity]. Here, we find the Bray-Curtis dissimilarity index between all of the samples in the Wu dataset and use PCoA to plot them on two dimensions. From the plot, it is evident that NK cells are most dissimilar from all other celltypes when considering the Bray-Curtis index.

When using the Bray-Curtis dissimilarity index between all of the samples in the Six dataset, we find that similarly to the Wu dataset, NK cells appear dissimilar from the remainder of the celltypes, while Monocytes and Granulocytes appear most similar to one another.

Circos

Utilities